Skip to content

TST: Use tempdir for cufile tests (and mark some as thread unsafe) #2218

Merged
seberg merged 2 commits into
NVIDIA:mainfrom
seberg:fix-cufile-tests
Jun 16, 2026
Merged

TST: Use tempdir for cufile tests (and mark some as thread unsafe) #2218
seberg merged 2 commits into
NVIDIA:mainfrom
seberg:fix-cufile-tests

Conversation

@seberg

@seberg seberg commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

Split out, since it also seems that changing these to use a temp_dir (necessary to run them
in parallel in multiple threads) also fixes the CI failures.

This PR also adds thread_unsafe markers to a few tests here (and to the pytest.ini, I have used the other one for other tests in the full changeset)

Towards gh-2194

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

seberg added 2 commits June 15, 2026 12:34
It seems that changing these to use a temp_dir (necessary to run them
in parallel in multiple threads) also fixes the CI failures.
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@seberg seberg self-assigned this Jun 15, 2026
@github-actions github-actions Bot added the cuda.bindings Everything related to the cuda.bindings module label Jun 15, 2026
@seberg seberg added test Improvements or additions to tests P0 High priority - Must do! and removed test Improvements or additions to tests labels Jun 15, 2026
@seberg

seberg commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test b9f4c54

@github-actions

This comment has been minimized.

@seberg seberg added bug Something isn't working and removed bug Something isn't working labels Jun 15, 2026
@seberg seberg assigned seberg and unassigned seberg Jun 15, 2026
@seberg seberg added this to the cuda.bindings 13.3.1 milestone Jun 15, 2026

@mdboom mdboom left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but also pinging @sourabgupta3 for viz.

@seberg

seberg commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@sourabgupta3 if you have any thoughts, happy to follow up. But I think it's low-risk and I would like to not pile up related PRs.

Beyond using tmp_dir, it marks tests as thread-unsafe that use stats counters as those are global and it seems even stats_start()/stats_stop() aren't actually thread-safe but that seems unsurprising (i.e. if anything a documentation thing).

@seberg seberg merged commit 75e6960 into NVIDIA:main Jun 16, 2026
111 of 121 checks passed
@seberg seberg deleted the fix-cufile-tests branch June 16, 2026 09:07
@github-actions

This comment has been minimized.

1 similar comment
@github-actions

Copy link
Copy Markdown
Doc Preview CI
Preview removed because the pull request was closed or merged.

Comment on lines -121 to -125
xfail_handle_register = pytest.mark.xfail(
condition=isSupportedFilesystem() and os.environ.get("CI") is not None,
raises=cufile.cuFileError,
reason="handle_register call fails in CI for unknown reasons",
)

@leofang leofang Jun 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for catching up slowly. IIRC this decorator was first added in #1271 (there was a long discussion that seems not decipherable to me now) and is still tracked in #1307. Not sure if this rings any bell to @rwgk? Is there any QA system that would require this decorator?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bot told me that if @sourabgupta3 or @rsarpangalav could confirm that the actual root cause of us having to add xfail_handle_register was because the CWD being on a non-ext4 mount inside the container, then this PR would allow us to declare #1307 as resolved and close it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being impatient. This only affects os.environ.get("CI") is not None so I would think QA doesn't matter.

However... I glanced over isSupportedFilesystem() and that was silly. The temporary folder is not necessarily on the same file system after all. I'll follow up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened gh-2233 but maybe with some input (or revisiting) we can also figure out which check is missing from the supported filesystem check that would have made it reliable on CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.bindings Everything related to the cuda.bindings module P0 High priority - Must do! test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants